Exploratory Data Analysis will be performed on a dataset describing the hourly count of rental bikes between the years 2011 and 2012 for the Capital Bikeshare Program in Washington, D.C. The dataset contains bike ridership information as well as corresponding weather data.
Afterwards with the attributes provided, based on our analysis, we will create a preliminary model to predict the number of bikeshare usage for a given hour on a specific day.
| instant | dteday | season | yr | mnth | hr | holiday | weekday | |
|---|---|---|---|---|---|---|---|---|
| Min. : 1 | Length:17379 | Min. :1.000 | Min. :0.0000 | Min. : 1.000 | Min. : 0.00 | Min. :0.00000 | Min. :0.000 | |
| 1st Qu.: 4346 | Class :character | 1st Qu.:2.000 | 1st Qu.:0.0000 | 1st Qu.: 4.000 | 1st Qu.: 6.00 | 1st Qu.:0.00000 | 1st Qu.:1.000 | |
| Median : 8690 | Mode :character | Median :3.000 | Median :1.0000 | Median : 7.000 | Median :12.00 | Median :0.00000 | Median :3.000 | |
| Mean : 8690 | NA | Mean :2.502 | Mean :0.5026 | Mean : 6.538 | Mean :11.55 | Mean :0.02877 | Mean :3.004 | |
| 3rd Qu.:13034 | NA | 3rd Qu.:3.000 | 3rd Qu.:1.0000 | 3rd Qu.:10.000 | 3rd Qu.:18.00 | 3rd Qu.:0.00000 | 3rd Qu.:5.000 | |
| Max. :17379 | NA | Max. :4.000 | Max. :1.0000 | Max. :12.000 | Max. :23.00 | Max. :1.00000 | Max. :6.000 |
| workingday | weathersit | temp | atemp | hum | windspeed | casual | registered | |
|---|---|---|---|---|---|---|---|---|
| Min. :0.0000 | Min. :1.000 | Min. :0.020 | Min. :0.0000 | Min. :0.0000 | Min. :0.0000 | Min. : 0.00 | Min. : 0.0 | |
| 1st Qu.:0.0000 | 1st Qu.:1.000 | 1st Qu.:0.340 | 1st Qu.:0.3333 | 1st Qu.:0.4800 | 1st Qu.:0.1045 | 1st Qu.: 4.00 | 1st Qu.: 34.0 | |
| Median :1.0000 | Median :1.000 | Median :0.500 | Median :0.4848 | Median :0.6300 | Median :0.1940 | Median : 17.00 | Median :115.0 | |
| Mean :0.6827 | Mean :1.425 | Mean :0.497 | Mean :0.4758 | Mean :0.6272 | Mean :0.1901 | Mean : 35.68 | Mean :153.8 | |
| 3rd Qu.:1.0000 | 3rd Qu.:2.000 | 3rd Qu.:0.660 | 3rd Qu.:0.6212 | 3rd Qu.:0.7800 | 3rd Qu.:0.2537 | 3rd Qu.: 48.00 | 3rd Qu.:220.0 | |
| Max. :1.0000 | Max. :4.000 | Max. :1.000 | Max. :1.0000 | Max. :1.0000 | Max. :0.8507 | Max. :367.00 | Max. :886.0 |
## instant dteday season yr mnth hr holiday weekday workingday
## 1 1 2011-01-01 1 0 1 0 0 6 0
## 2 2 2011-01-01 1 0 1 1 0 6 0
## 3 3 2011-01-01 1 0 1 2 0 6 0
## 4 4 2011-01-01 1 0 1 3 0 6 0
## 5 5 2011-01-01 1 0 1 4 0 6 0
## 6 6 2011-01-01 1 0 1 5 0 6 0
## 7 7 2011-01-01 1 0 1 6 0 6 0
## 8 8 2011-01-01 1 0 1 7 0 6 0
## 9 9 2011-01-01 1 0 1 8 0 6 0
## 10 10 2011-01-01 1 0 1 9 0 6 0
## weathersit temp atemp hum windspeed casual registered cnt
## 1 1 0.24 0.2879 0.81 0.0000 3 13 16
## 2 1 0.22 0.2727 0.80 0.0000 8 32 40
## 3 1 0.22 0.2727 0.80 0.0000 5 27 32
## 4 1 0.24 0.2879 0.75 0.0000 3 10 13
## 5 1 0.24 0.2879 0.75 0.0000 0 1 1
## 6 2 0.24 0.2576 0.75 0.0896 0 1 1
## 7 1 0.22 0.2727 0.80 0.0000 2 0 2
## 8 1 0.20 0.2576 0.86 0.0000 1 2 3
## 9 1 0.24 0.2879 0.75 0.0000 1 7 8
## 10 1 0.32 0.3485 0.76 0.0000 8 6 14
## 'data.frame': 17379 obs. of 17 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-01" "2011-01-01" "2011-01-01" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ hr : int 0 1 2 3 4 5 6 7 8 9 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 6 6 6 6 6 6 6 6 6 ...
## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...
## $ weathersit: int 1 1 1 1 1 2 1 1 1 1 ...
## $ temp : num 0.24 0.22 0.22 0.24 0.24 0.24 0.22 0.2 0.24 0.32 ...
## $ atemp : num 0.288 0.273 0.273 0.288 0.288 ...
## $ hum : num 0.81 0.8 0.8 0.75 0.75 0.75 0.8 0.86 0.75 0.76 ...
## $ windspeed : num 0 0 0 0 0 0.0896 0 0 0 0 ...
## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...
## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...
## $ cnt : int 16 40 32 13 1 1 2 3 8 14 ...
## [1] 17379 17
From UCI’s Machine Learning Repository:
For visualization purposes, let’s convert the following attributes into factors
setDT(wd)
# Also convert dteday to the Posix Date format
wd[, date:=ymd_hms(with(wd, paste(dteday, " ", hr,":00:00", sep="")),
tz="EST")]
wd[, dteday:=as.Date(dteday)]
wd[, hr.factored:=factor(hr)]
wd[, weathersit.factored:=factor(weathersit, levels=c(1,2,3,4),
labels=c("Ideal", "Good", "Bad", "Poor"))]
wd[, holiday.factored:=factor(holiday, labels=c("Regular Day", "Holiday"))]
wd[, workingday.factored:=factor(workingday, labels=c("Weekend or Holiday",
"Working Day"))]
wd[, weekday.factored:=factor(weekday, levels=c(0,1,2,3,4,5,6),
labels=c("SU","M","T","W","TH","F","S"))]
wd[, season.factored:=factor(season, levels=c(1,2,3,4),
labels=c("Spring","Summer","Fall","Winter"))]
wd[, mnth.factored:=factor(mnth)]
wd[, yr.factored:=factor(yr, levels=c(0,1), labels=c(2011,2012))]
According to the dataset description, the following mappings for the seasons are: (1:spring, 2:summer, 3:fall, 4:winter). If that was the case, then spring and summer should have slightly more entries than fall and winter, which leads to the suspicion that seasons are labeled incorrectly. We can check this by plotting the previous plot and coloring by season.
Looks like the seasons are off by one. Let’s address this and redraw the plots. Seasons are now correctly labeled.
Since we are looking at the full calendar year of 2011 and 2012, the distribution of seasons are almost uniform. Seasons are an indicator of changes in temperature pattern, and may prove of use as a feature. However since we have months, the season attribute may not be as useful.
prop.table(table(wd$workingday.factored))
##
## Weekend or Holiday Working Day
## 0.3172795 0.6827205
The first bar plot shows an expected small percentage of holiday entries. As for the second bar plot, approximately 32% percent of the days are holidays or weekends and 68% working days.
Table of days that are holidays:
## dteday
## 1: 2011-01-17
## 2: 2011-02-21
## 3: 2011-04-15
## 4: 2011-05-30
## 5: 2011-07-04
## 6: 2011-09-05
## 7: 2011-10-10
## 8: 2011-11-11
## 9: 2011-11-24
## 10: 2011-12-26
## 11: 2012-01-02
## 12: 2012-01-16
## 13: 2012-02-20
## 14: 2012-04-16
## 15: 2012-05-28
## 16: 2012-07-04
## 17: 2012-09-03
## 18: 2012-10-08
## 19: 2012-11-12
## 20: 2012-11-22
## 21: 2012-12-25
## dteday
## [1] "Distribution of Weather Conditions."
##
## 1 2 3 4
## 0.6567121238 0.2614649865 0.0816502676 0.0001726221
66% of the days are clear (Ideal), 26% partly cloudy (Good), and 8% rainy/snowy/foggy weather (Bad + Poor). Bad and poor weather conditions may exhibit a negative correlation with bikeshare usage.
## [1] "Denormalized Temperature Summary, Farenheit"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 19.00 46.00 60.00 59.66 73.00 102.00
## [1] "Denormalized Adjusted Temperature Summary, Farenheit"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.00 43.00 61.00 59.72 77.00 122.00
## [1] "Denormalized Humidity Summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 48.00 63.00 62.72 78.00 100.00
## [1] "Denormalized Windspeed Summary"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 7.002 13.000 12.740 17.000 57.000
Note: The weather attributes above have been denormalized to their appropriate values.
The measured temperature in Washington, D.C. exhibits a bimodal distribution, indicating that there are 4 seasons.
Compared to the measured temperature, adjusted temperature has more spread out measurements. Adjusted temperature readings are a function of temperature, humidity, precipitation, and windspeed, and because of this I hypothesize that this attribute will have the most correlation with bikeshare usage.
There is great variation in humidity. I assume it will also correlate with bikeshare usage.
Majority of windspeed temperature lies between 7 mph and 17 mph. This low variation suggests that it will not affect bikeshare usage drastically. Outliers in windspeed also indicate unfavorable weather conditions, which is what the weathersit attribute attempts to capture.
In regards to weather readings, I hypothesize that the adjusted temperature readings will capture most of the correlation with bikeshare usage due to it being a function of temperature, humidity, and windspeed. However, with temperature, humidity and windspeed present, there might not be a need to include adjusted temperature in the model.
Looks like Washington, D.C. experiences the typical four seasons according to the temperature plots above. Feeling temperature has much more variability than actual temperature. It is quite possible that bikeshare usage may be more correlated with feeling temperature.
There also seems to be an anomaly in reading the feeling temperature between 2012-07 and 2012-10 where one or more temperature readings are much lower than the norm. Although it might not make a difference while calculating correlation, it’s better to remove that observation from our dataset.
Adjusted temperature plot after anomaly has been removed:
# Find the date where the anomaly occurs
unique(wd[wd$dteday > as.Date("2012-07-01") &
wd$dteday < as.Date("2012-10-01") &
wd$atemp.denormed < 37, .(dteday)])
## dteday
## 1: 2012-08-17
# Check if atemp is bad for the whole day
wd[wd$dteday == as.Date("2012-08-17"), .(temp.denormed, atemp.denormed)]
## temp.denormed atemp.denormed
## 1: 75 32
## 2: 73 32
## 3: 73 32
## 4: 72 32
## 5: 72 32
## 6: 72 32
## 7: 70 32
## 8: 72 32
## 9: 75 32
## 10: 77 32
## 11: 80 32
## 12: 82 32
## 13: 87 32
## 14: 89 32
## 15: 90 32
## 16: 90 32
## 17: 89 32
## 18: 87 32
## 19: 87 32
## 20: 80 32
## 21: 79 32
## 22: 75 32
## 23: 73 32
## 24: 72 32
## temp.denormed atemp.denormed
# atemp is bad for the whole day, so replace with temp values
my_temp <- wd[wd$dteday == as.Date("2012-08-17")]$temp
my_temp.denormed <- wd[wd$dteday == as.Date("2012-08-17")]$temp.denormed
wd[wd$dteday == as.Date("2012-08-17"), atemp := my_temp]
wd[wd$dteday == as.Date("2012-08-17"), atemp.denormed := my_temp.denormed]
ggplot(wd,aes(x=date, y=atemp.denormed)) +
geom_point(aes(colour=atemp.denormed)) +
scale_colour_gradient2(name="Temp ºF", low = "blue", mid = "green" ,
high = "red", midpoint = 60) +
geom_smooth(color = "red",size = 1) +
scale_y_continuous(breaks=seq(0,125,25)) +
ggtitle ("Hourly Feeling Temperature") +
xlab("Date") + ylab ("Hourly Feeling Temperature ( ºF )")
There is huge variation in humidity on a daily basis, with an upward trend in humidity starting in the summer and downward trend starting around mid autumn.
Windspeed also varies on a daily basis, with an upward trend in windspeed during Winter and Spring and a downward trend in Summer and Fall. There are also some days with windspeed greater than 50 mph.
We have analyzed bikeshare usage against temperature, adjusted temperature, humidity, and windspeed in great detail for Washington, D.C. over a period of two years. It will be interesting later on to see how these attributes affect bikeshare usage with a daily average or when attributes such as time of day and day of the week are fixed.
Here we will analyze what role suspected attributes have with bikeshare usage. Since temperature was just discussed in great detail, let’s start off with looking at hourly bikeshare usage and weather.
The shape of the distribution for 2012 changed drastically due to growth in the system, indicating that there is growth due to demand and popularity independent of weather conditions.
Months give a finer resolution of how seasons affect bikeshare usage.
By taking a look at the variance in ridership for each of the hours we can see that there are definitely factors that contribute to ridership for each hour. There are two peak times, 8am and 5pm. The users using the bikeshare program at this time are probably people working in the city.
By faceting between whether it’s a working day or not, we have identified that different peak times exist.
Bikeshare usage for the days of the week can be seperated by weekends and weekdays, which reaffirms the use of the workingday attribute.
In the dataset there are two variables: holiday and workingday. Holiday indicates whether the day is a holiday or not. Workingday indicates whether it is a working day or not. As we can see in the plots above, there is a decrease in bikeshare usage if it is a holiday. There is also a decrease in bikeshare ridership if it is not a working day.
To get a better look at how the weekend by itself effects bikeshare usage, let’s combine the two attributes into an attribute called type_of_day with the following mappings:
Holidays show slightly less bikeshare usage than weekends, but it may not be enough to be significant. There seems to be more bikeshare usage on working days.
Random forest will be used to find the relationship between bikeshare usage and the given attributes, due to the observed sinusoidal relationships seen in the analysis.
The year, month, hour, type of day (holiday, weekend, or working day), temperature, humidity, and windspeed will be used as features.
Only one temperature attribute will be used because the two temperature attributes are highly correlated with each other. More rigorous feature selection can be done later on to determine what is better suited.
What stands out here is that the year attribute is very important by random forest. This validates the previous observation that bikeshare usage is growing in the number of registered and casual users year over year. Let’s see if that changes the importance of the relationship between attributes and bikeshare usage.
We can see that for both years, hour and temperature play a big role in bikeshare usage. In 2012, temperature seems to play more of a role in bikeshare useage. The importance of type of day is also stronger in 2012. This may suggest a change in trends due to the increased bikeshare usage in 2012.
Bikeshare usage is most affected by the hour of the day. There are more people using the bikeshare program during rush hour on the weekdays and around the afternoon on the weekends.
The year over year system growth in bikeshare usage going from 2011 to 2012 increased significantly. What is also interesting is that the distribution for the years 2011 and 2012 are very different, with 2012 seeing more users on average as well as higher outlier numbers.
Temperature is also one of the major factors influencing bikeshare usage. There is a lower number of bikeshare usage when around freezing temperatures, suggesting that it may be too cold and a bit dangerous using the bike during those circumstances. Hot weather doesn’t seem to affect bikeshare usage, but it definitely does not encourage an increase in usage, relative to warm temperatures.
Overall, it seems that the Capital Bikeshare program is increasing in usage, and that it is being used by the community for both commuting and for leisure. Something to keep an eye out for would be the increasing growth rate, which would mean that more bikes will be needed to bought to keep up. It would be interesting to see how bikeshare programs in other cities with different climates compare with Capital Bikeshare program’s success.
I chose to investigate the Capital Bikeshare program to solidify my understanding of analyzing attributes for the purpose of linear regression analysis. What I didn’t expect at the beginning was the cyclical patterns that exist in multiple attributes. I started off with using linear models in the beginning to find relationships, but I quickly learned that it was not the model of choice for a feature set that has cyclical patterns. Through some research and due diligence I discovered that random forest was a quick way to find correlations between bikeshare usage and the given attributes. Another time consuming lesson that I learned was that it is important to narrow what you want to explore to make it possible to finish this project.